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Abstract 

In this paper we parameterize non-negative matrices of sum one and 
rank at most two. More precisely, we give a family of parameterizations 
using the least possible number of parameters. We also show how 
these parameterizations relate to a class of statistical models, known 
in Probability and Statistics as mixture models for contingency tables. 
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1 Introduction 

The study of non-negative matrices with fixed rank has recently attracted a 
great deal of work both theoretical and applied. One of the main problems in 
this field is the so-called "non-negative matrix factorization problem" , which 
can be shortly stated as follows. Given a non-negative matrix A € M I ^ xJ 
(where K+ denotes the set of real non-negative numbers), one has to find 
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an approximation of A as a linear combination of k dyadic products qr-, 
where the Cj's and r^s are vectors with non-negative entries, i.e. c% G R+ 
and rj G R £ + . 

The rank of a matrix gives the numbers of rank one matrices, i.e. dyadic 
products, needed to write the matrix as a sum of dyads. But there are no 
non- negative conditions on the vectors of the dyads. The non-negativity 
constraints make the situation more comple x and one has to work with t he 
non-negative rank of the matrix (see e.g. Cohen and Rothblum ( 19931 )). 
which is in general bigger than the ordinary rank. Therefore, it is not 
possible in general to decompose a rank k matrix into the sum of exactly 
k dyadic products Qr* where q and j*j are non-negative vectors. We will 
review the main results about non-negative rank in the next section. 

In recent literature, a number of results and a lgorithms for non-nega tive 
matrix factor i zation have been published, see e.g. (Lee and Seun3 fcood ). In 
Catral et all (120041) special t e chniq ues for symmetric tables are presented, 
while in Ho and Van Dooren ( 20081) the case of fixed row and column sums 



i s ana lyzed, with applications to stochastic matrices. In lFinesso and Spreij 



(|200fih . ' he authors discuss some connections between the factorization prob- 
lem an d the notion of /-divergence, which has a well known s tatistical role, 
see e.g. bacunha-Castelle and Duflol (jl98fih and lPardol (j2005h . 

From the point of view of Probability, non-negative matrices are a nat- 
ural tool in the analysis of two-way contingency tables. A two-way contin- 
gency table A = (a^,) collects data from two categorical random variables 
measured on n subjects. Let us suppose that the first variable X has I levels 
1, . . . , I and the second variable Y has J levels 1, . . . , J. The element aij 
is the count of subjects with X = i and Y = j. Therefore, A is an / x J 
matrix with non-negative integer entries. 

A joint probability distribution for the pair (X, Y) is a probability matrix 
with / rows and J columns P = {pij) of non-negative real numbers such 
that J2ijPij = 1- A statistical model M. for / x J contingency tables is a 
set of probability distributions, i.e. a subset of the simplex 



P = (Pi,j) ■ Pi,j > °) ^2Pi,j 



1> C 



One of the most widely used models fo r two-w ay contingency tables is the 
independence model, see e.g. lAgrestJ (|2002h . It is defined through the 
vanishing of all 2 x 2 minors of the generic matrix, i.e. by the equations 



Pi,jPl,h ~ Pi,hPl,j = 1 <i < I < I, 1 <j < h < J; 
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thus, the points of the independence model are rank 1 matrices. 

Recent developments in Statistics have shown the relevance of probabil- 
ity models whose points are matrices of rank at most 2. One example in 
this direction, based on a s pecial symmetric matrix, is the so-called "100 



Swiss francs problem", see Sturmfelsl (2008). This problem comes from 



Computation al Biology, where it is u s eful t o analyze the alignment of DNA 



sequences, see lPachter and Sturm fels (2005). Although this particular prob- 
lem has been solved in Gao et al. ( 20081 ). the study of fixed-rank probability 
matrices is mainly unexplored. 

As the sum of k matrices with rank 1 has rank at most k, the matrices 
which can be written as the sum of k dyadic products encode the notion of 
mixture of k distributions from independence models. 

In Probability and Statistics it is interesting not only to study the ap- 
proximation problem mentioned above, but also to have a parametrization 
of t he mode l s. W hile for rank 1 matrices the parametrization is easy, see 
Agrestil hooi ). the problem bec omes difficult in the ca se of higher non- 



e.g 



negative ranks. Already for k = 2, in iFienberg et~al\ (|2O10h it is shown that 



the model is not identifiable, meaning that different parameter values lead 
to the same probability distribution. 

This issue is a well known problem in statist i cal mo dellin g called "param- 



eter r edundancy", see iCatchpole and Morgan ( 19971 ) and Catchpole et al. 
(1998). The detection of parameter redundancy has a major relevance in 
maximum likelihood estimation, where the parameters of a statistical mod- 
els are estimated through the m aximizat i on of a real-valued function called 
"likelihood function", see e.g. lAgrestil (12002^ . In the papers mentioned 
above, the authors propose a purely analytical technique to detect the pa- 
rameter redundancy of a statistical model, by computing the rank of the 
Jacobian matrix of a specific function. The redundancy is checked through 
Symbolic Algebra computations and the problem of redundancy is overcome 
via additional linear constraints on the parameters. 

In this paper, we propose a method which uses linear algebra to make 
the maximization problem simpler by reducing the number of parameters 
involved. Then the usual analytic techniques can be used in a more effective 
way. 

The paper is organized as follows: in Section [2] we introduce some def- 
inition an we recall some basic facts. In Section [3] we study the problem 
of parameters redundancy form a geometric point of view. In Section |4] we 
show a possible application of our results. 
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2 Definition and background material 



Let P = (pij) be a probability matrix with / rows and J columns, i.e. 
P € A. In order to simplify the formulae, let us suppose that I < J. Let k 
be an integer, 1 < k < I. 

Definition 2.1. A probability matrix P is the mixture of k independence 
models if it can be written in the form: 

P = a\c\r\ + . . . + a k c k r\ (3) 

where for all h = 1, . . . , k 

• a h £R + and ^ h a h = 1; 

• rieKj and ]T\ r h(i) = 1/ 

• Ch € and Yjj Ch{j) = 1. 

Definition 12.11 contains a simple parametric form of the probability dis- 
tribution which has an intuitive probabilistic counterpart. Let us suppose 
that we have k pairs of dice, say (Di >r , Di >c ), ■ ■ ■ , {Dk, T , Dk, c ), where D^r 
has J facets and distribution and P>h,c bas / facets and distribution c^. 
We choose a pair of dice with probability distribution a = (a%, . . . and 
we roll the selected pair of dice. The resulting distribution is just a mixture 
distribution as in Eq. ([3|). 

As a Linear Algebra counterpart, the definition above is strictly related 
with the notion of non-negative rank o f a ma trix. For more on non-negative 
rank see, e.g., ICohen and Eothbluml (jl993l b We recall here some useful 
facts. 

Definition 2.2. Given a matrix P with real non-negative elements, the non- 
negative rank of P is the smallest number of non-negative column vectors 
vi, . . . ,Vk of P such that each column of P has a representation as a linear 
combination of v\, . . . ,v k with non-negative coefficients. The non-negative 
rank of a matrix P is denoted with rk + (P) . 

The definition above has an equivalent formulation in terms of linear 
combinations of row vectors. In the following proposition we summarize 
the main properties of the non-negative rank. The reader can refer to 
Cohen and Rothbluml (|l99.^ for proofs and further details. The non-negative 



rank is of special relevance for Probability and Statistics. In fact, rk + (A) is 
the number of dyadic products of non-negative vectors that we can use to 
represent A. 
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Proposition 2.3. Let P , Q be two non-negative matrices with I rows and 
J columns. 



(a) rk(P) < rk+(P) < min{/, J}; 

(b) rk+(P) = rk+(P'); 

(c) rk+(P + Q) < rk+(P) + rk+(Q). 

Moreover, if P has dimensions I x K and Q has dimensions K x J, then 
rk+(PQ) < min{rk+(P),rk + (Q)}. 



Items (b) — (d) in Proposition 12.31 show that the non-negative rank has 
properties similar to the classical rank. In general, the rank and non-negative 
rank are different, as shown by the following matrix 



1 0\ 
10 1 
110 
\0 1 1/ 



which has rank 3 but non-negative rank 4. 

Among the cases where the rank and the non-negative rank coincide, 
there are the following special classes of matrices Cohen and Rothblum ( 19931 ). 

Proposition 2.4. Let P be a non-negative matrix with L rows and J columns. 

(a) 7/rk(P) < 2 then rk+(P) = rk(P); 

(b) If P is diagonal, then rk + (P) = rk(P). 

In what follows we will heavily use part (a) of Proposition 12.41 Hence, 
for the convenience of the reader, we produce a self contained proof of this 
fact for probability matrices. 

Lemma 2.5. Let P be a probability matrix. If rk(P) < 2, then rk + (P) = 
rk(P). 

Proof. If rk(P) = 1 then the proof is trivial; thus we will assume rk(P) = 2. 
Denote with Ci,i = 1 . . . , J the columns of P. We will show that there 
exist two columns, say C and C, such that C\ = tiC + SjC for all i and the 
coefficients tj's and Sj's are non- negative. 

Clearly, as P has rank at most two, all columns are linear combinations of 
two fixed ones. Without loss of generality, we may assume that C\ and C2 are 
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linearly independent. Thus for any other column we have Cj = t{C\ + SjC^. 
If all the pairs (ti,Si) are non-negative we are done. Otherwise, consider in 
the plane M 2 the rays spanned by the pairs (U,Si) and let (t, s) and (f, s) 
be the extremal rays and denote by C and C the corresponding columns. 
We recall that the extremal rays are the minimal generators of the convex 
cone spanned by the the pairs (ij, Sj). Now consider the angle <f> between the 
extremal rays containing at least one positive semi-axis. If <f> < tt radiants 
then we are done by using the addition rule for vectors in the plane and 
all the columns are non-negative linear combinations of C and C. If <j) = tt 
radiants we get the contradiction as C + C = and hence C\ and C2 
would be proportional. If <p > tt we get again a contradiction. In fact, 
a non-negative combination of the extreme rays would be in the negative 
quadrant. Hence, a non- negative linear combination of C and C would be 
non-positive and hence equal to zero being P non-negative. Thus, C\ and 
C2 would be proportional again. □ 



3 Parameters and parameterizations 

Often in Probability and in Statistic models are described using parame- 
ters. This description can be easily expressed in geometric terms. Given 
the variety representing the model we look for a surjective function into it. 
More precisely, if Ai is the model, a surjective function U C R n — > Ai 
gives a parametrization of Ai. If the function we found is described by ra- 
tional functions and its image is dense in the model, we say that the map is 
dominant and we describe the model up to a measure zero set. 

Given a model Ai there are two basic questions: Does there exist a dom- 
inant map W 1 — > Ail What is the smallest n for which such a map exists? 
Answering the first question is a deep a nd difficult p roblem in Geometry 
called "the unirationality problem", see (jHarrisl . fl992 . page 87). The sec- 



ond question is difficult too, but we can easily give a bound on n using the 
dimension of Ai, namely we must have n > dimM. 

When we have a parametrization of a model Ai such that n = dimA^ 
we say that the parametrization is non-redundant, or that the parame- 
ters are non-redundant. It is not always possible to find a non-redundant 
parametrization. But, in some interesting situations, it is possible to decom- 
pose the model Ai as union of subvarieties and for each of this one can find a 
non-redundant parametrization. We will give examples of these phenomena 
in the case of rank k and rank 2 mixture models. 
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3.1 A parametrization for the rank k matrices 

Given natural numbers I < J we consider the following family of matrices 
with rank at most k: 



M k =\p= Cp M ) £l /xJ : rk(P) < k , Y,Pi,i = 1 
^ »J 

As the elements of M k have rank at most k, they can be written as 
a linear combination of at most k rank one probability matrices. More 
precisely, if P S M k then 

P = axarl + ... + a k c k r\ (4) 

for a choice of scalars a^s and of column vectors q's and r^'s. Hence, we can 
represent elements of M. k using 

k(I + J) + k 

parameters. In other words, gives a surjective polynomial map 

R k(i+j)+k — > Mk 

We recall that a map between algebraic varieties, say V\ — > V2, can be a 
parametrization, only if dimFi > dimV2. To know whether the parameters 
we are using are necessary or redundant, we need to know the dimension of 
Mk and compare it with k(I + J) + k. 

Proposition 3.1. With the notation above, we have 

dimM k < k(I + J) - k 2 - 1. 

Proof. The dimension of the family of compl ex I x J ma trices of rank at 
most k is well known to be k(I + J) — k 2 , see Harris! (1992). Imposing that 



the sum of al the entries is 1 and taking real matrices give the bound. □ 

Proposition 13.11 shows that the parametrization @ is redundant and we 
are using more parameters than the best possible value. Actually, it is not 
possible to use k(r + s) — k 2 — 1 parameters to get all the elements of M k - 
In the case of k = 2 we will show how to decompose in open subsets 
which can each be described using the optimal number of parameters. 
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3.2 Non-redundant parameterizations of probability models 

for k = 2 



In this section we only deal with matrices of rank at most two. Hence we 
fix k = 2 and we set 



M =Mr 



-{ 



P € 



t IxJ 



rk 



+ 



(P) = 2} 



n A. 



In this situation, dim.A/( < 21 + 2 J — 5 and we will use this number of 
parameters to describe Ai, hence finding a non-redundant parametrization. 
Set D = 21 + 2J — 5. We will construct maps 



f 31,32 '■ ^ 31,32 



M 



for 1 < ji < j'2 < J, with the property that the union of the images of the 
fj u j 2 is the whole M, i.e. \Jlm(fj 1 j 2 ) = M. 

Each map is constructed in such a way that Iva.{fj 1) j 2 ) is contained in the 
open subset of the matrices with the ji-th and the j'2-th columns linearly 
independent. We give an explicit description only for the other cases 
being completely analogous. We set 



71,2(01, . . . , a/_i, 63, . . . , bj, ci, . . . , c/_i, d 3 



= a 



01 

02 



( i-E&i h 



,dj,a) 



bj ) + 



OJ-I 
C2 



+ (1 



a 



C/-1 



( l-^dj d 3 ... dj ), 



defined on 

U'1,2 = {(«!,• 



,07-1,63, . . . ,bj,ci, . . . ,C7_i,d 3 , ... ,dj,a) G 



< di,a<l and < ^ a«, ^ 6 i; ^ q, ^ ^ < l| . 

To define fj u j 2 one simply moves element in the row vectors. In the first 
row vector the 1 — ^ b% element is moved in position j\ and the is moved 
in position j 2 ; similarly for the second row vector. 



8 



Remark 3.2. With standard computations one can easily check that 

Im(/ jlJ2 ) C M 

for all ji and 32, ji < h- 

Now we analyze the functions fj u j 2 i n order to derive some useful prop- 
erties. We work with f\^ and all the results trivially extend to the other 
functions. 

Lemma 3.3. Let P £ M. be the following matrix 

( x 1 y 1 ... Uxi + so/I . . . tjxi + s J y 1 \ 



P = 



\ xi yi ... Uxi + Siyi . . . tjxi + sjyi J 

where the coefficients Xi,yi,Si and ti are non-negative. 
If the first two columns of P are non-zero, we set 



Xi 

Oi — , Ci 



Vi 



E^' E: 



-,h = 



,di = 



and also a = (E U + 1) E x i = 1 ~ (E s i + 1) E Vi- 



and also a = 1, and Ci = di = for all i. 
IfJ2 x i = 0; we se t 



in 



Si 



and also a = and a« = hi = /or a// i. 

If we set P' = (ai, . . . , a/_i, 63, ... , bj, a,... ,ci-i,ds, ...,dj, a), then 
P' eU' 12 and 

h,2{P') = P. 

Proof. The definition of P' and the condition on the entries of P yield that 
P' e ^1,2- A straightforward computation shows that fi^(P') = P- The 
two expressions for the parameter q coincide as P is a matrix with sum 
one. □ 
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Finally we can show that the maps fj lt j 2 & ve a parametrization of A4. 

Corollary 3.4. The variety M is covered by the images of the functions 
fjid2i more precisely \Jlm(f juh ) = M. 

Proof. Let P E M, by Lemma 12.51 we know that P can be written as in 
the statement of Lemma 13.31 for some columns Cj 1 and Cj 2 and hence P € 



4 An application 

It is often interesting to find maxima and minima of a function over a variety. 
As an example consider the well known likelihood function. We will use the 
parametrization we found in the previous sections to propose a strategy to 
study extremal points on M. The advantage of this approach is that we are 
going to study functions involving the least possible number of variables as 
the parametrization we found is non-redundant. 

Remark 4.1. Given a function F : A4 — ► K we consider the composite 
functions F o fj u j 2 . Consider a point P = fj 1 j 2 (P') £ -M. such that P is 
in the interior oflm(fj 1 j 2 ). Then P is a maximum/minimum for F if and 
only if P' is a maximum/minimum for F o fj 1 j 2 . 

Using Remark 14.11 we can apply the usual gradient and Hessian matrix 
approach to detect extremal points of F lying in the interior of one of the 
Iva(fj 1 j 2 ). Hence it useful to have the following: 

Lemma 4.2. If P 1 is in the interior ofU'^ - 2 then fj 1 j 2 (P') * s * n ^ e interior 
ofIm(f juj2 ). 

Proof. We produce a proof for j\ = 1 and ji = 2 but a completely analogous 
argument works in the general situation. Given P' we compute P = fi2(P') 
and thus we write P as in the statement of Lemma 13.31 Moreover, as P' 
is in the interior of U[ 2 the coefficients ij and Sj in P are strictly positive. 
Now consider a neighborhood U of P. Given a matrix Q G U we can write 
it in the form of Lemma 13.31 by computing the coefficients ij and the Sj. 
This is done by solving linear systems of equations having the elements of 
Q as coefficients. Hence, it is possible to choose a suitable U such that 
for all the matrices in U the coefficients ij and Sj are strictly positive. In 
conclusion, the formulae of Lemma 13.31 produce a map gi^ '■ U — > U[ 2 - It 
is straightforward to see that g\ t 2 is a continuous map on U and that the 
map 

/l,2 °5l,2 



10 



is the identity map. Now we take a neighborhood of P', say U' C f 1 %(U). 
Then we get a neighborhood of P 

g^(U')clm(f 1<2 ) 

and we are done. □ 

Lemma 14.21 shows that we only have to worry about points of M which 
are images of boundary points of Uj 1 j 2 . Thus it is useful to have the following 
description: 

Lemma 4.3. Let P' G - 2 be the point 

P' = (01, . . . ,aj_i,6 3 ,.. . ,bj,ci, . . . ,c/_i,d 3 , ... ,dj,a) 
and let P = fjij 2 (P')- Then the following hold: 

1. if any of the coefficients a{ or Ci is zero then P is a point of the bound- 
ary of M.; 

2. if a i = 1 or = 1 then P is a point of the boundary of Ai ; 

3. if a = or a = 1 then P is a rank one matrix; 

4- if any of the coefficients bi or d{ is zero then is P has at least two 
proportional columns; 

5. if '^2 a i = 1 or Yl bi = 1 then P has at least two proportional columns. 

Proof. For ([1]) and ([2]) it is enough to notice that P has some zero ele- 
ment. Hence a neighborhood of P contains matrices with negative entries. 
Thus P is on the boundary of A4. The other cases are obtained by direct 
computations. □ 

By Lemma 14.31 we see that the composite map F o fj lt j 2 will detect 
maxima and minima of F if these extremal points do not have rank one or 
if they have rank two and do not have two proportional columns. In many 
situation of interest rank one matrices can be efficiently treated, e.g. for the 
likelihood function. Rank two matrices with proportional columns can be 
treated using our parametrization in a subtler way. 

Lemma 4.4. Let P = fj 1 .j 2 (Pj 1 j 2 ) ^ e a ran k two matrix with at least two 
proportional columns. Then a neighborhood of P in M can be covered using 
images of neighborhoods of Pj lt j 2 in Uj 1 j 2 for different pairs (j'1,,72). 
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Proof. Given P, choose two independent columns, say the ji-th and the 
j2-th. As P has proportional columns, when written as in Lemma 13.31 some 
of the coefficients ti and Sj vanish. Hence, in each neighborhood of P there 
will be matrices requiring negative values of the coefficients ti or Sj. Then 
there is no neighborhood where the formulae of the Lemma can be applied 
to get and inverse on fj 1 j 2 and hence we can not reproduce the argument of 
Lemma H2J But we can find a neighborhood of Pj 1 j 2 , say Wj i j 2 C Uj 1 j 2 , 
such that there exists an inverse of fj 1 j 2 on fh&v^'h j 2 )> but * ms * s no * 
a neighborhood of P. By Lemma 1231 we see that the fj 1 ,j 2 (^j 1 j 2 ) cover a 
neighborhood of P as (j'1,,7'2) varies and we are done. □ 

We can now describe our strategy. Given a function F : A4 — ► R we 
can look for maxima and minima of F in following way: 

1. study F on rank one matrices using an ad hoc method. W hen F is 



the lik elihood function, the problem is quite simple, see e.g. lAgresti 
(|2002h : 



2. consider the functions Fofj l j 2 and compute their maxima and minima 
on - 2 for all 1 < ji < 32 < J (notice that these computation are as 
simple as they could be as the least number of variable is involved); 
let Q be one of the point we found; 

3. if Q is in the interior of one of the - 2 then fj lt j 2 (Q) is a maximum 
or minimum of F; 

4. if Q lies on the boundary of one of the - 2 and fj 1 j 2 (Q) is on the 
boundary of M, then fj 1 j 2 {Q) is a maximum or minimum of F; 

5. if Q lies on the boundary of one of the j a and fj 1 j 2 (Q) has rank 
one we already treated this case in the first step; 

6. if Q lies on the boundary of one of the • and fj 1 ,j 2 (Q) has two 
proportional columns, then Q will lie on the boundary of at least two 
of the U', - a ; for each each pair (ji , ) such that Q is on the boundary 
of J7j 1 - 2 we have to compare the extremal behavior of the functions 
Fo/jjjj, if these behavior agree then fj 1 j 2 (Q) is a maximum/minimum 
of F otherwise it is not. 

In this paper we only considered matrices of rank at most two. For 
higher values of the rank the situation gets much more involved and almost 
impossible to treat. For example, it is not even known how to effectively 
compute the non- negative rank of a matrix. But, some preliminary results 
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in lDong et al. 1 420091 ) suggest that matrices with non-negative rank different 
from the ordinary rank are exceptional, i.e. they form a zero-measure set. 
This observation can be of some help to try and extend our approach. 
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